data <- read.csv("insurance.csv")
head(data)
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
Predictor variables:
Categorical : sex, smoker, region
Continuous : age, bmi, children
Response variable : charges
lin_model <- lm(charges~., data = data)
summary(lin_model)
##
## Call:
## lm(formula = charges ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
## age 256.9 11.9 21.587 < 2e-16 ***
## sexmale -131.3 332.9 -0.394 0.693348
## bmi 339.2 28.6 11.860 < 2e-16 ***
## children 475.5 137.8 3.451 0.000577 ***
## smokeryes 23848.5 413.1 57.723 < 2e-16 ***
## regionnorthwest -353.0 476.3 -0.741 0.458769
## regionsoutheast -1035.0 478.7 -2.162 0.030782 *
## regionsouthwest -960.0 477.9 -2.009 0.044765 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
As observed, sex predictor has a high p-value. So we cannot reject the null hypothesis. This is not a significant predictor.
lin_model <- lm(charges~.-sex, data = data)
summary(lin_model)
##
## Call:
## lm(formula = charges ~ . - sex, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11367.2 -2835.4 -979.7 1361.9 29935.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11990.27 978.76 -12.250 < 2e-16 ***
## age 256.97 11.89 21.610 < 2e-16 ***
## bmi 338.66 28.56 11.858 < 2e-16 ***
## children 474.57 137.74 3.445 0.000588 ***
## smokeryes 23836.30 411.86 57.875 < 2e-16 ***
## regionnorthwest -352.18 476.12 -0.740 0.459618
## regionsoutheast -1034.36 478.54 -2.162 0.030834 *
## regionsouthwest -959.37 477.78 -2.008 0.044846 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6060 on 1330 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7496
## F-statistic: 572.7 on 7 and 1330 DF, p-value: < 2.2e-16
As observed, region predictor has a high p-value. So we cannot reject the null hypothesis. This is not a significant predictor. It is only significant for particular values of alpha, hence we decide to remove it.
lin_model2 <- lm(charges ~ (.-sex-region), data = data)
summary(lin_model2)
##
## Call:
## lm(formula = charges ~ (. - sex - region), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11897.9 -2920.8 -986.6 1392.2 29509.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12102.77 941.98 -12.848 < 2e-16 ***
## age 257.85 11.90 21.675 < 2e-16 ***
## bmi 321.85 27.38 11.756 < 2e-16 ***
## children 473.50 137.79 3.436 0.000608 ***
## smokeryes 23811.40 411.22 57.904 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6068 on 1333 degrees of freedom
## Multiple R-squared: 0.7497, Adjusted R-squared: 0.7489
## F-statistic: 998.1 on 4 and 1333 DF, p-value: < 2.2e-16
We observe that all predictors have less p-values and for all the null hypothesis can be rejected. All these predictors are significant for the model. We also observe that with these predictors the model can predict 74.97% of outputs.
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
dwtest(lin_model2)
##
## Durbin-Watson test
##
## data: lin_model2
## DW = 2.0874, p-value = 0.945
## alternative hypothesis: true autocorrelation is greater than 0
DW=2 suggests no autocorrelation
plot(lin_model2)
From the residual vs fitted graph, we observe that the data shows pattern and is not random, hinting at the presence of heteroscedasticity. From the Q-Q plot, we observe that the data is skewed and not normal. From residuals vs leverage plot, we observe that there are no points that are beyond cook’s distance. So, the model does not have influential points.
library(lmtest)
bptest(lin_model2)
##
## studentized Breusch-Pagan test
##
## data: lin_model2
## BP = 117.16, df = 4, p-value < 2.2e-16
The Breusch-Pagan (BP) test indicates that there is heteroscedasticity in the data. So, we proceed to transform the data.
log_model <- lm(log(charges) ~ age+bmi+children+smoker,data=data)
summary(log_model)
##
## Call:
## lm(formula = log(charges) ~ age + bmi + children + smoker, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.11340 -0.19883 -0.04688 0.07197 2.07581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.9827766 0.0697227 100.151 < 2e-16 ***
## age 0.0347826 0.0008805 39.502 < 2e-16 ***
## bmi 0.0106096 0.0020264 5.236 1.91e-07 ***
## children 0.1011976 0.0101989 9.922 < 2e-16 ***
## smokeryes 1.5432438 0.0304372 50.703 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4491 on 1333 degrees of freedom
## Multiple R-squared: 0.7622, Adjusted R-squared: 0.7614
## F-statistic: 1068 on 4 and 1333 DF, p-value: < 2.2e-16
plot(log_model)
bptest(log_model)
##
## studentized Breusch-Pagan test
##
## data: log_model
## BP = 80.917, df = 4, p-value < 2.2e-16
This is indicating that even with the log-transform the heteroscedasticity is still present, so we use square transform.
model_square <- lm((charges^2) ~ age+bmi+children+smoker,data=data)
summary(model_square)
##
## Call:
## lm(formula = (charges^2) ~ age + bmi + children + smoker, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -707112670 -144270762 -18280529 120504755 2730241564
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -793695547 52779464 -15.038 <2e-16 ***
## age 6834317 666556 10.253 <2e-16 ***
## bmi 20298532 1533972 13.233 <2e-16 ***
## children 8590667 7720482 1.113 0.266
## smokeryes 1057514060 23040686 45.898 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.4e+08 on 1333 degrees of freedom
## Multiple R-squared: 0.6435, Adjusted R-squared: 0.6425
## F-statistic: 601.6 on 4 and 1333 DF, p-value: < 2.2e-16
plot(model_square)
bptest(model_square)
##
## studentized Breusch-Pagan test
##
## data: model_square
## BP = 218.78, df = 4, p-value < 2.2e-16
This also did not help much with the heteroscedasticity.
model_square_root <- lm((charges^0.5) ~ age+bmi+children+smoker,data=data)
summary(model_square_root)
##
## Call:
## lm(formula = (charges^0.5) ~ age + bmi + children + smoker, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.836 -11.452 -4.541 3.376 103.942
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.87277 3.50841 -0.249 0.804
## age 1.40495 0.04431 31.709 < 2e-16 ***
## bmi 0.92997 0.10197 9.120 < 2e-16 ***
## children 3.25331 0.51320 6.339 3.15e-10 ***
## smokeryes 90.55542 1.53158 59.125 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.6 on 1333 degrees of freedom
## Multiple R-squared: 0.7769, Adjusted R-squared: 0.7762
## F-statistic: 1160 on 4 and 1333 DF, p-value: < 2.2e-16
plot(model_square_root)
bptest(model_square_root)
##
## studentized Breusch-Pagan test
##
## data: model_square_root
## BP = 29.317, df = 4, p-value = 6.741e-06
Even this does not help with the problem. Another transform that can be used is WLS method.
weights <- 1 / lm(abs(residuals(lin_model2)) ~ fitted(lin_model2))$fitted.values^2
model_wls <- lm(charges ~ age + bmi + children + smoker, data = data, weights = weights)
summary(model_wls)
##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker, data = data,
## weights = weights)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -2.0860 -0.7195 -0.4814 -0.0409 13.1996
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3934.17 663.76 -5.927 3.92e-09 ***
## age 243.16 10.49 23.183 < 2e-16 ***
## bmi 70.29 22.25 3.158 0.00162 **
## children 645.13 117.82 5.476 5.20e-08 ***
## smokeryes 22725.01 777.53 29.227 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.6 on 1333 degrees of freedom
## Multiple R-squared: 0.5538, Adjusted R-squared: 0.5525
## F-statistic: 413.6 on 4 and 1333 DF, p-value: < 2.2e-16
plot(model_wls)
bptest(model_wls)
##
## studentized Breusch-Pagan test
##
## data: model_wls
## BP = 1.2726e-05, df = 4, p-value = 1
The bp test for the model after transforming with WLS indicates that there is no heteroscedasticity.
library(car)
## Loading required package: carData
vif(model_wls)
## age bmi children smoker
## 1.073255 1.042860 1.033168 1.004907
All values are very close to 1 suggesting that there is no issue of multicollinearity
The low p-value of the F-statistic indicates that there is at least one independent variable that helps predict the target variable.
Age, bmi, children, and smoker are all significant features and together give the model an R-squared value of 0.5538. This means that the features together can predict 55.38% of the outputs.
Our model is charges = -3934.17 + 243.16age + 70.29bmi + 645.13children + 22725.01smoker
For a smoker, charges = 243.16age + 70.29bmi + 645.13children + (22725.01-3934.17) For a non smoker, charges = 243.16age + 70.29bmi + 645.13children - 3934.17
For every observation, charges increases by an average of beta1*w for a unit average increase of age, where w is the weight for that observation and all other predictors are fixed
For every observation, charges increases by an average of beta2*w for a unit average increase of bmi, where w is the weight for that observation and all other predictors are fixed
For every observation, charges increases by an average of beta3*w for a unit average increase of children, where w is the weight for that observation and all other predictors are fixed
For every observation, charges is beta4*w more for a smoker than a non-smoker, where w is the weight for that observation and all other predictors are fixed
model1<-lm(charges~age+bmi+children+smoker+region+sex+age*bmi+age*children+age*smoker+age*region+age*sex+bmi*children+bmi*smoker+bmi*region+bmi*sex+children*smoker+children*region+children*sex+smoker*region+smoker*sex+region*sex,data=data)
summary(model1)
##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region +
## sex + age * bmi + age * children + age * smoker + age * region +
## age * sex + bmi * children + bmi * smoker + bmi * region +
## bmi * sex + children * smoker + children * region + children *
## sex + smoker * region + smoker * sex + region * sex, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11971.4 -1979.7 -1233.8 -212.7 30056.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.682e+03 2.591e+03 -0.649 0.516186
## age 1.919e+02 5.292e+01 3.626 0.000298 ***
## bmi 4.958e+01 8.392e+01 0.591 0.554788
## children 9.058e+02 6.729e+02 1.346 0.178485
## smokeryes -2.103e+04 1.937e+03 -10.856 < 2e-16 ***
## regionnorthwest -1.611e+03 2.294e+03 -0.702 0.482761
## regionsoutheast 2.941e+03 2.193e+03 1.341 0.180147
## regionsouthwest -4.435e+02 2.191e+03 -0.202 0.839615
## sexmale -1.260e+03 1.581e+03 -0.797 0.425510
## age:bmi 1.168e+00 1.635e+00 0.714 0.475086
## age:children -3.123e+00 8.607e+00 -0.363 0.716839
## age:smokeryes -3.251e+00 2.400e+01 -0.135 0.892263
## age:regionnorthwest 1.975e+01 2.741e+01 0.721 0.471192
## age:regionsoutheast 5.074e+01 2.749e+01 1.846 0.065153 .
## age:regionsouthwest 4.764e+01 2.787e+01 1.709 0.087656 .
## age:sexmale 1.577e+01 1.915e+01 0.823 0.410556
## bmi:children 4.843e-01 1.916e+01 0.025 0.979840
## bmi:smokeryes 1.487e+03 5.649e+01 26.330 < 2e-16 ***
## bmi:regionnorthwest -7.191e+00 7.019e+01 -0.102 0.918420
## bmi:regionsoutheast -1.893e+02 6.113e+01 -3.097 0.001995 **
## bmi:regionsouthwest -8.750e+01 6.727e+01 -1.301 0.193584
## bmi:sexmale 3.576e+00 4.636e+01 0.077 0.938528
## children:smokeryes -3.848e+02 2.878e+02 -1.337 0.181483
## children:regionnorthwest 2.625e+02 3.256e+02 0.806 0.420147
## children:regionsoutheast -2.073e+02 3.229e+02 -0.642 0.520963
## children:regionsouthwest -3.803e+02 3.100e+02 -1.227 0.220035
## children:sexmale -2.456e+02 2.236e+02 -1.098 0.272196
## smokeryes:regionnorthwest -1.610e+02 9.740e+02 -0.165 0.868728
## smokeryes:regionsoutheast -1.145e+03 9.273e+02 -1.235 0.217237
## smokeryes:regionsouthwest 1.016e+03 9.872e+02 1.029 0.303475
## smokeryes:sexmale -2.702e+01 6.792e+02 -0.040 0.968275
## regionnorthwest:sexmale 3.828e+02 7.660e+02 0.500 0.617308
## regionsoutheast:sexmale 6.368e+02 7.705e+02 0.826 0.408682
## regionsouthwest:sexmale 2.631e+02 7.702e+02 0.342 0.732700
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4835 on 1304 degrees of freedom
## Multiple R-squared: 0.8445, Adjusted R-squared: 0.8406
## F-statistic: 214.7 on 33 and 1304 DF, p-value: < 2.2e-16
The only significant synergy is bmi with smoker
interaction_term_model <- lm(charges ~ age+bmi+children+smoker+bmi*smoker, data = data)
summary(interaction_term_model)
##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + bmi *
## smoker, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14598.6 -1924.4 -1321.4 -465.6 29892.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2729.002 831.270 -3.283 0.00105 **
## age 264.948 9.553 27.735 < 2e-16 ***
## bmi 5.656 24.873 0.227 0.82014
## children 508.924 110.615 4.601 4.61e-06 ***
## smokeryes -20194.709 1654.505 -12.206 < 2e-16 ***
## bmi:smokeryes 1433.788 52.823 27.143 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4871 on 1332 degrees of freedom
## Multiple R-squared: 0.8388, Adjusted R-squared: 0.8382
## F-statistic: 1387 on 5 and 1332 DF, p-value: < 2.2e-16
We observe that the inclusion of synergy term helps the model predict 83.88% of values. This is 28.5% more than the transformed model.
plot(interaction_term_model)
From the residual vs fitted graph, we observe that the data shows pattern and is not random, hinting at the presence of heteroscedasticity. From the Q-Q plot, we observe that the data is skewed and not normal. From residuals vs leverage plot, we observe that there are no points that are beyond cook’s distance. So, the model does not have influential points.
bptest(interaction_term_model)
##
## studentized Breusch-Pagan test
##
## data: interaction_term_model
## BP = 7.0967, df = 5, p-value = 0.2136
BP test suggests that there is no heteroscedasticity in this model.
The low p-value of the F-statistic indicates that there is at least one independent variable that helps predict the target variable.
Age, children, smoker, and synergy term are all significant features and together give the model. bmi is not a significant feature as it has a high p-value but we include it due to the inclusion of the synergy term. R-squared value is 0.8388. This means that the features together can predict 83.88% of the outputs.
Our model is charges = -2729.002 + 264.948age + 5.656bmi + 508.924children - 20194.709smoker + 1433.788bmi*smoker
For a smoker, charges = -(2729.002 + 20194.709) + 264.948age + (5.656 + 1433.788)bmi + 508.924children
For a non smoker, charges = -2729.002 + 264.948age + 5.656bmi + 508.924children
Charges increases by an average of beta1 for a unit average increase of age, where all other predictors are fixed
Charges increases by an average of beta2 for a unit average increase of bmi, where all other predictors are fixed
Charges increases by an average of beta3 for a unit average increase of children, where all other predictors are fixed
In case if the person is smoker, the charges decreases by an average of 20194.709 units with an average increase of 1433.788 for unit increase in bmi. This is the change over the base case of non-smoker.